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Analysis of Student Responses to 
Peer-Instruction Conceptual Questions 
Answered Using an Electronic Response 
System: Trends by Gender and Ethnicity 

This descriptive study investigated students ’ answers to geoscience conceptual 
questions answered using electronic personal response systems. Answer 
patterns were examined to evaluate the peer-instruction pedagogical 
approach in a large general education classroom setting. 


Over the past decade , it has become 
apparent that effective learning occurs 
in Science, Technology, Engineering 
and Mathematics (STEM) classrooms 
that use student-centered, active 
approaches that allow interactive 
exchange between and amongst 
students and instructors (American 
Geophysical Union, 1994; National 
Research Council, 1997, 2000; 
National Science Foundation, 1996). 
Such exchanges are facilitated when 
students use electronic personal 
response systems to answer conceptual 
multiple choice questions, called 
conceptests by Mazur (1997) and 
referred to as think-pair-share exercises 
in some disciplines (McTighe & 
Lyman, 1988). Conceptests are 
repetitive measures designed to 
explore student depth of understanding 
(both individual and group), and they 
often include answers with known 
preconceptions. Students consider 
the question and respond individually. 
Crouch and Mazur (200 1 ) suggest that 
an initial correct response rate of 35% 


- 70% is optimal for these questions. 
Peer instruction is a practice in which 
students work together in pairs and 
small groups to discuss and defend 
their responses (Mazur, 1997), and 
this discussion may be followed by 
a second round of student responses. 
The use of conceptests is formative, 
because they provide timely feedback 
that the instructor and student can use 
to improve their performance. Much 
has been written about the ways in 
which this technique can be used by 
faculty (Cox & Junkin, 2002; Crouch 
& Mazur, 2001; Green, 2003; Hake, 
1998; Mazur, 1 997 ; McConnell , Steer, 
Owens & Knight, 2006; Pilzer, 2001 ; 


Responses were analyzed 
for predictability, construct 
validity and gender 
reliability assuming a 
statistically normal response 
distribution. 


Rao & DiCarlo, 2000; Sokoloff & 
Thornton, 1997). The evidence is 
also compelling that this technique 
improves student learning from a 
course perspective (Crouch & Mazur, 
2001; King & Joshi, 2008; Lasry, 
Maur, & Watkins, 2008; Smith et al., 
2009) and that the technology is well 
received by students (MacGeorge 
et al., 2008a). Less is known about 
the impact this technique has on 
subpopulations of students based on 
gender and race. 

The conceptests used in this study 
were taken from a large database of 
questions for the geosciences that were 
developed by more than 30 geoscience 
faculty members with multiple years 
of experience teaching introductory 
courses in a variety of settings (e.g. 
community college, small 4-year, 
and public universities). Those 
faculty members used their personal 
experiences and a review of the 
published literature to develop lists of 
geoscience concepts that are difficult 
for students to grasp and are discussed 
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in most typical introductory geoscience 
courses for non-majors . Some of these 
concepts include plate tectonics, 
geologic time, the rock cycle, and 
the water cycle. The conceptests were 
generated according to good practices 
for writing multiple choice questions 
(Haladyna, Downing, & Rodriquez, 
2002) by focusing on a single concept, 
using simple language or graphics, 
and including 3-4 short answers 
that require few or no calculations. 
The distracters (incorrect answers) 
also include alternative conceptions, 
misconceptions, or incorrect intuitive 
responses. The conceptests probe 
student understanding at various 
cognitive levels and emphasize the 
comprehension and application 
(“understanding” and “applying” 
levels in Anderson and Krathwohl 
[200 1 ] ) through analysis and evaluation 
levels of cognitive processing (Bloom, 
1956). 

This study focuses explicitly 
on conceptual questions at the 
understanding , applying, and 
analyzing cognitive levels (Anderson 
& Krathwohl , 200 1 ) , because , these are 
the most appropriate levels to assess 
using multiple-choice formats. The 
questions are posed as text- , diagram- , 
or graph-based problems , and they are 
similar to questions on the summative 
exams. At the understanding level, 
students demonstrate they are able to 
convert concepts learned as text to an 
illustration or vice versa. Students are 
also asked to compare and contrast 
objects or concepts, select reasons, 
compare solutions , or make predictions 
(see Figure 1). At the applying level, 
students apply rules or principles to 
new situations , use known procedures 
to solve problems, or demonstrate 
that they know how to do something. 
When working at the analyzing 
level, students select answers that 


explain how something 
works or distinguish fact 
from opinion. Questions 
that require students to 
scrutinize graphical data 
or images are interpreted 
as analysis questions, 
especially if the students 
have not previously seen 
the graph (see Figure 2). 

In the landscape 
pictured, how would the 
amount of rainfall change at location 
X if the mountain eroded down to the 
dashed line? 

a) Rainfall would increase 

b) Rainfall would decrease 

c) Rainfall would stay the same 


Figure 2: Example of a graph-based, analysis- 
level conceptest related to the rock cycle. 


Figure 1: Example of a diagram-based-based, 
understand-level conceptest related to the orographic 
lifting of air. 




The graph illustrates how the 
temperature changed with time for 
part of the rock cycle. Which of the 
following is best represented by the 
graph? 

a) Sand is lithified to form 
sandstone 

b) Limestone is metamorphosed 
to form marble 

c) Marble is uplifted to Earth’s 
surface 

d) Magma cools to form granite 

e) Shale is heated and converted 
to magma 


Methods 

The data used for this study 
represents 4712 responses to 
conceptests collected from 242 
students enrolled in four earth science 
classes for non-science majors and one 
physical geology class at a community 
college. These classes were taught 
by three instructors, each with 
over five years of teaching 
experience using active 
learning strategies. In addition 
to incorporating conceptests 
using peer instruction (Mazur, 
1997; McConnell, Steer, 
Owens, & Knight, 2006), 
classes were taught using a 
variety of learner-centered 
activities including the use of 
student-manipulated physical 
models (Gilbert & Ireton, 
2002), lecture tutorials (Kortz, Smay, 
& Murray, 2008), and predictive 
demonstrations (Sokoloff &Thomton, 
1997). Students earned participation 
points for responding to conceptests, 
regardless of whether the answers 
were correct or incorrect . Three classes 
occurred in spring 2008, and two 
classes occurred in fall 2008. 

This study reports conceptest 
response trends for paired answers 
from students who answered from 10- 
26 questions each over the course of the 
semester. The questions are assumed 
to be valid for content since they were 
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Table 1: Scoring rubric for studentresponses to conceptest questions. 


Pre-Discussion 

Post-Discussion 

Score 

% of 

Answer 

Answer 

Responses 

Correct 

Incorrect 

i 

5% 

Incorrect 

Incorrect 

2 

28% 

Incorrect 

Correct 

3 

26% 

Correct 

Correct 

4 

41% 


Note that, as averaged over all questions, 33% of students recorded 
incorrect responses after peer instruction, and the remainder recorded 
a correct answer on the second attempt. 


developed by geoscience educators 
and have been reviewed for content 
validity by 12 experts across multiple 
institutions. Reliability and validity 
testing was completed for the questions 
using responses collected in spring 
2006 from a large-format, general 
education introductory earth science 
class (155 students). Responses 
were analyzed for predictability, 
construct validity and gender 
reliability assuming a statistically 
normal response distribution. Correct 
response rates for the questions as a 
whole were not gender biased (p>0 .35 , 
n=55). Three individual questions 
appeared to show bias even after 
addition of response data for the same 
questions from fall 2005 . As a set, the 
52 remaining conceptest questions 
used in this study met predictive 
validity requirements . The percentage 
of students correctly responding to 
comprehension-, application- and 
analysis-level questions decreased 
with increasing question cognitive 
level (p<0.0001; 67%, 52% and 36% 
respectively). 

Student responses from conceptests 
answered during lessons that used the 
peer instruction technique were scored 
using a rubric (Table 1). Those scores 
were used to evaluate the efficacy of 
this pedagogical technique for various 
populations (male , female , Causcasian , 
and minority). Students in selected 
courses completed a 15 question, 
Geoscience Concept Inventory (GCI) 
test (Libarkin & Anderson, 2005) as an 
independent assessment of geoscience 
conceptual understanding. The GCI 
is a valid and reliable assessment 
designed to assist geoscience faculty 
in evaluating teaching and learning 
(Libarkin & Anderson, 2005). Its 
purpose and design are similar to the 
Force Concept Inventory (Hestenes, 


Wells, & Swackhammer, 1992) that is 
widely used in physics education. 

Note that, as averaged over 
all questions, 33% of students 
recorded incorrect responses after 
peer instruction, and the remainder 
recorded a correct answer on the 
second attempt. 

Student engagement was determined 
by dividing the number of student 
answers to conceptests by the total 
number of questions posed. For 
example, a score of 70% on student 
engagement was recorded by a student 
who answered 70% of the conceptests 
analyzed in the study. These scores 
were a proxy indicator of attendance. 
Average conceptest scores were 
calculated by dividing the number 
of correct answers by the number of 
questions asked, and no deduction 
was made for unanswered questions. 
Individual student response rates for 
each question category (Table 1) were 
calculated by dividing the number of 
responses in a category by the total 
number of questions answered by 
that student. Final course grades and 
post-course GCI scores were also used 
as summative assessments of student 
success. Response data were grouped 
by gender and ethnicity for analyses. 
African American, Asian, Pacific 
Islander, and Hispanic were combined 
under the ‘minority’ classification. 


All data fields were not available 
for all students (due to student absence 
during administration of the GCI, 
missing self-reported data, failure 
to complete the course, etc.). In all, 
five variables (pre-GCI, post-GCI, 
final grades, average proportion of 
correct answers on conceptests, and 
engagement) were analyzed for each 
of the four populations (minority 
male, minority female, Caucasian 
male, Caucasian female). Pearson’s 
correlation coefficients (5) were 
calculated for the 20x20 matrix 
with values of 0.1 -0.3 considered of 
small significance, 0.3 -0.5 moderate, 
and 0.5- 1.0 large. Comparisons 
between larger populations (male- 
female, minority-Caucasian) were also 
completed using ANOVA or statistical 
T-tests using Cohen’s d values for 
effect sizes , and values of p<0 .05 were 
considered significant. 

Data 

Data were sorted by both gender and 
race (Figure 3) to show how student 
responses were distributed in the four 
paired-response categories (correct- 
incorrect, twice incorrect, incorrect- 
correct , twice correct; see Table 1 ) . The 
total response database included 6% 
minority male (n = 282), 8% minority 
female (n =385), 52% Caucasian male 
(n = 245 1 ) , and 34% Caucasian female 
(n = 1594) responses. 
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Figure 3: Percent of student conceptest responses by demographic group. 
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Correct-Incorrect: Overall, 
approximately 5% of responses 
showed students answered 
conceptest questions correctly the 
first time the question was posed and 
incorrectly on the second attempt 
(Table 1 ; Figure 3). There were not 
enough responses in this answer 
category for meaningful analyses 
between population groups. 

Twice-Incorrect: About 28% of 
all responses were incorrect on both 
attempts (Table 1; Figure 3). As a 
percentage of their responses , male 
minority students were most likely 
to answer in this way (over 36% of 
their responses, Figure 3). Female 
students of both demographic 
groups answered in this fashion 
in approximately equal proportions 
(32%) , and Caucasian males answered 
twice incorrect 26% of the time. 
Comparisons in twice incorrect 
response rates between minority 
populations and for Caucasian females 
compared to other populations showed 
small effect sizes (d = 0.0 to 0.3). 
There were moderate effect sizes when 
comparing male Caucasian response 
rates in the twice incorrect category 
to minority males and females (d = 
0.5 and 0.6). 

Twice Correct: The largest 
differences between populations 
were noted when analyzing the 4 1 % of 
twice-correct responses (Table 1 : score 
4; Figure 3). Caucasian male students 
were most likely to answer correctly 
both times (45% of responses). Their 
female counterparts answered in 
this fashion about 39% of the time. 
Female minority students were least 
likely to answer twice correct (26% 
of responses), and minority males 
answered in this way 32% of the time. 
Effect sizes were small to moderate 
when comparing female Caucasian 
students to males (d = 0 .3 for minority 


males; 0.4 for Caucasian males) and 
when comparing minority males and 
females (d = 0.4). Effect sizes were 
larger when analyzing Caucasian 
males to both minority populations (d 
= 0.6 for males; 1 .3 for females) and 
when comparing female populations 
(d = 0.9). 

Incorrect-Correct: Overall, 26% 
of the responses were incorrect on 
the first attempt, but correct after 
peer instruction (Table 1: score 3). 
At this level, minority females faired 
better than their minority male peers 
(35% versus 27% of responses) and 
slightly better than Caucasian students 
(26% for Caucasian females and 24% 
for males). Effect sizes were small 
when comparing minority males to 
both Caucasian populations (d = 0.0 
for males; 0.2 for females). All other 
response rate comparisons in the 
incorrect-correct category displayed 
moderate effect sizes (d = 0.5 to 
0.7). 

Combined Responses: When 
average response rates for individual 
students by demographic group were 


compared to other course variables 
(pre- and post-GCI, final grades, 
conceptest average , and engagement) , 
several trends appeared (Table 2). 

Male minority students: Minority 
male conceptest averages were 
strongly correlated with post-course 
GCI scores (5=0.9; Table 2: row D, 
column B) and moderately correlated 
with final grades (6=0 .5 ; Table 2: row 
D, column C). Pre-course GCI scores 
(Table 2: column A) were strongly 
correlated to final grades (Table 2: 
row C) and post GCI scores (Table 
2: row B) for this population (5=0.6 
and 0.7). Engagement (Table 2: row 
E) displayed a moderately negative 
correlation with post-GCI scores 
(Table 2: column B) and moderately 
positive correlations to final grades 
and average conceptest scores (5=0.4; 
Table 2: columns C and D). 

Female minority students: Minority 
female average conceptest responses 
(Table 2, row I) displayed a strong 
negative correlation with pre-GCI 
scores (5=-0.6; Table 2: column F). 
Final course grades (Table 2: row H) 
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Table 2: Pearson’s correlation coefficients for studied populations 
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Between populations (Note: Correlations for unrelated variables are removed from the table) 


were moderately correlated (6=0.5) 
to pre-GCI scores (Table 2: column 
F) and engagement (Table 2: row J, 
column H). 

Male Caucasian students: 
Male Caucasian students recorded 
moderately correlated engagement 
and final course grades (6=0.5; Table 
2: row O, column M). Pre- and post 
course GCI scores (Table 2: row L, 
column K) were also moderately 
correlated to post GCI results (6=0.4; 
Table 2: column L), as were average 
conceptest scores (Table 2: row N). 

Female Caucasian students: 
Female Caucasian student data showed 
only one strong correlation (6=0.6), 
and that was between engagement 
(Table 2: row T) and final grades 
(Table 2: column R) . All other within- 
group correlations were small or 
insignificant. 

Between Group Correlations: 

Moderately significant correlations 
were found when variables were 
compared between population groups . 


Male and female minority student pre- 
GCI scores were correlated (6=0.5; 
Table 2: row F, column A), and male 
minority post-GCI scores (Table 2: 
column B) were negatively correlated 
with post-GCI scores of all other 
demographic groups. Pre-GCI scores 
for female minority students (Table 2: 
column F) were correlated with both 
Caucasian males and females (6=0.4 
and 0.5). Other correlations between 
groups were either between variables 
that had no practical relationship and 
were not shown (e.g. minority male 
pre-GCI scores and minority female 
post-GCI scores) or were of little or 
no significance. 

Interpretation and 
Discussion 

Correct/Incorrect Responses: We 

considered an initial correct response 
followed by an incorrect answer choice 
to be the least desirable response 
sequence. The 5% of responses for 
which students answered correctly 


initially but changed to an incorrect 
response following peer instruction 
was similar to the 6% rate reported 
by Crouch and Mazur (2001). These 
data suggest that such responses should 
be expected regardless of ethnicity or 
gender (Figure 3) . The 5% rate closely 
matches a four- answer multiple choice 
question occurrence probability of 6% 
for random guessing on two identical 
questions (probability increases to 
about 1 0 % for a three answer question) . 
Since students were awarded credit for 
answering the questions (whether 
correct or incorrect) , it is possible that 
some students were simply guessing or 
answering randomly to fulfill course 
requirements (King & Joshi, 2008). It 
is also possible some of these responses 
simply represent input error. Such 
an error was possible, because the 
electronic response software provided 
signals when student responses were 
received , but did not display individual 
responses. However, ineffectual peer 
instruction also can not be ruled out. If 
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guessing and input error accounted for 
most correct-incorrect responses , those 
answers provided little information 
relevant to student assessment or 
teaching. Additional studies are 
necessary to determine if correct- 
incorrect responses are important 
indicators of student learning when 
using this technology. 

Correct/Correct Responses: A 
twice-correct answer was considered a 
positive result , because such a response 
suggested students initially understood 
the concepts and then validated 
that understanding by answering 
correctly a second time. The overall 
twice-correct answer rates found here 
closely matched the 40% rate reported 
by Crouch and Mazur (2001) and 
played a major role in understanding 
similarities and differences between 
populations. Since Caucasian male 
and female students were more 
likely to answer twice correct, their 
other major answer categories had 
proportionally fewer responses than 
those of minority students (Figure 
3). Such an observation supports the 
contention that differences within 
diverse populations can be more 
important than differences between 
populations (Harper, 2009). When 
student data were sorted into two 
groups (>40% and <40% of responses 
twice correct), there was a strong 
correlation between engagement and 
final grade for both the high- and 
low-performing groups (5=0 .6) . Such 
a finding was not surprising, because 
engagement is a proxy for attendance , 
which has been previously correlated 
to course success (Newman-Ford, 
Fitzgibbon, Lloyd & Thomas, 2008; 
Scott, 2000). 

Incorrect/Incorrect: As with the 
correct-incorrect answer, a twice 
incorrect response was considered 
a negative outcome, because it 



This is the first study to 
examine the contrasts in 
student performance by both 
gender and ethnicity using 
electronic response systems 
in large classes. 

suggests the peer-learning technique 
was not effective for the students 
that answered in this fashion. The 
28% overall response rate for twice- 
incorrect questions was higher than 
was reported for physics (22% in 
Crouch & Mazur, 2001). The finding 
that over one quarter of all responses 
were incorrect after peer discussion 
was particularly troubling in light of 
the fact that 40% of responses were 
twice correct, because this suggests 
that more correct responses result 
from other learning than from peer 
instruction. Students were randomly 
organized into four-person learning 
teams to encourage in-class discussion 
during the peer instruction phase 
of the class. The correct answer for 
most of the conceptests was also the 
most popular answer when students 
were polled on the first attempt. 
Armed with that information and 
group discussion support, such a high 
level of twice incorrect answers was 
considered problematic . ANOVA and 
correlation analyses showed that there 
were indistinguishable differences 
(p >0.05) and correlations (-0.2<= 
5 <=0.2) between students who 
frequently answered in this fashion 
(>25% of registered twice incorrect 
responses) as compared to those 
who did so less often (< 25%) for 
all analysis variables. If a significant 
number of students in these classes 
were not actually discussing answers, 
there may have been little propensity 


for students to change their answers. 
Perhaps students simply failed to 
change answers to questions if they did 
not understand the concepts and dialog 
was not effective enough to clarify 
understanding. Additional research 
that focuses on group interactions 
during peer discussion is necessary to 
determine the extent to which group 
communication affects twice incorrect 
response patterns. 

Incorrect/Correct Responses: 
The type of response sought when 
using conceptests with peer instruction 
was that of changing from an incorrect 
to a correct response (Table 1 : score 
3). Approximately 26% of student 
responses in this study were of this 
type, which is lower than the 32% 
reported by Crouch and Mazur (200 1 ) . 
These data suggest that peer instruction 
was nearly equally effective for all 
populations , but perhaps slightly more 
so for female minority students (who 
were ~7% more likely than any other 
demographic group to answer this 
way). Since minority students were 
more likely to miss these questions 
on the first attempt than Caucasian 
students , they were in a better position 
to benefit from this approach. 

Overall Responses: Combined 
analyses of all the response data 
suggested that there were similarities 
and differences in the ways that diverse 
populations respond when using this 
technology and pedagogical approach . 
When comparing males and females, 
all meaningful variable correlations 
were small or insignificant, which 
supports the suggestion made by 
King and Joshi (2008) that electronic 
response systems did not significantly 
hinder male or female student success 
in engineering. Within the male 
population, moderate correlations 
between pre- versus post-GCI scores, 
engagement versus final course grades , 
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and conceptest averages versus post- 
GCI scores were again identified. 
Within the female population, 
engagement was strongly correlated 
to final grades (5=0.6), and other 
variables correlated poorly. When 
the responses of all minority students 
were compared to the responses of all 
non-minority students , all meaningful 
variable correlations related to 
performance were insignificant 
or small, which suggests that this 
pedagogy provided an inclusive 
approach to formative assessment. 
Strong to medium correlations related 
to GCI scores and engagement suggest 
that prior knowledge and attendance 
played the most important role in 
minority students’ course success. 
This finding supports the use of this 
technology with these populations if 
doing so encourages attendance, as 
has been noted in previous studies 
(MacGeorge et al., 2008b). 

All populations could benefit 
if twice incorrect responses were 
minimized. This pedagogy relies on 
the positive group synergies known 
to be generated when learning with 
peers in a low-stakes environment 
(Mazur, 1997). Students placed in 
groups working toward a common 
goal, as is implicit in peer learning, 
provides a pseudo- organizational 
structure with social norms. Because 
of this , organizational learning theory 
(Argyris & Schon, 1996) may be 
an appropriate lens through which 
to view student response patterns. 
Central to such learning is the ability 
to detect errors (wrong answers) 
and take appropriate action (select 
correct answers) when responding 
to future opportunities (questions). 
This requires that members of the 
learning team work effectively and 
that the culture of the group be 
conducive to constructive dialog 


between all members of the team 
(Bensimon, 2005). An environment 
that is conducive to constructive 
dialogue is one in which all students 
are comfortable asking questions of 
their group members when they are 
not certain of the correct answer or 
when they consistently answer twice 
incorrect. The social dialog presumed 
to occur during peer instruction 
is known to result in successful 
performance among minority students 
(Quaye, Tambascia, & Talesh, 2009). 
However, the twice incorrect data 
presented here suggest that the optimal 
type of dialog was not occurring as 
often as desired for all populations. 
Clearly, all student groups have high 
and low performers. More detailed 
observations of student discussions 
are needed to better understand the 
dynamics and implications of dialog 
occurring in these groups and the 
impact of those peer discussions on 
response distributions. 

This is the first study to examine 
the contrasts in student performance 
by both gender and ethnicity using 
electronic response systems in large 
classes. Given the ubiquity of this 
technology on college campuses, 
these data are available in electronic 
archives for a wide range of classes. 
We encourage others to analyze their 
data to determine if the trends reported 
here apply more widely. 

Conclusions 

The similarities and differences 
in conceptest response patterns 
found here illustrate how data from 
electronic response systems can 
be used to evaluate a pedagogical 
technique such as peer- instruction. 
The relatively small percentage of 
correct-to-incorrect responses may 
simply be a function of operator error 
or lack of interest in the class activity. 


As a percentage of all responses 
within populations, males’ and 
females’ answers show very similar 
distributions, which implies that 
the pedagogical technique is gender 
neutral. Furthermore, the distribution 
for answer changes from incorrect to 
correct suggests that all demographic 
groups benefit nearly equally from 
peer discussions. Perhaps as expected, 
students who answer conceptual 
questions correctly the most often tend 
to score highest in course grades, and 
correct response rates are a moderate 
function of prior knowledge and 
attendance . However, the consistently 
high rate of twice incorrect answers 
for all groups, and particularly among 
minority males, is cause for concern. 
Better dialog within groups appears 
to be necessary for diverse student 
populations to benefit most effectively 
from this intervention. 
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